{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install matgraphdb\n", "!pip install ipykernel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 01 - Introduction to MatGraphDB\n", "\n", "This notebook demonstrates the basic usage of MatGraphDB, a graph database designed for materials science data. MatGraphDB allows you to:\n", "\n", "1. Store and query materials data\n", "2. Create relationships between different types of materials data\n", "3. Generate and manage derived data through generators\n", "\n", "Let's go through some basic examples." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import shutil\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Creating a MatGraphDB Instance\n", "\n", "The main class in MatGraphDB is the `MatGraphDB` class, which can be directly imported from the `matgraphdb` package. This class is used to create a new MatGraphDB instance. We'll store our data in a temporary directory." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from matgraphdb import MatGraphDB\n", "\n", "# Create a temporary directory for our database\n", "\n", "storage_path = \"MatGraphDB\"\n", "if os.path.exists(storage_path):\n", " shutil.rmtree(storage_path)\n", "\n", "os.makedirs(storage_path, exist_ok=True)\n", "\n", "\n", "# Initialize MatGraphDB\n", "mgdb = MatGraphDB(storage_path=storage_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the database summary" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "GRAPH DATABASE SUMMARY\n", "============================================================\n", "Name: MatGraphDB\n", "Storage path: MatGraphDB\n", "└── Repository structure:\n", " ├── nodes/ (MatGraphDB\\nodes)\n", " ├── edges/ (MatGraphDB\\edges)\n", " ├── edge_generators/ (MatGraphDB\\edge_generators)\n", " ├── node_generators/ (MatGraphDB\\node_generators)\n", " └── graph/ (MatGraphDB\\graph)\n", "\n", "############################################################\n", "NODE DETAILS\n", "############################################################\n", "Total node types: 1\n", "------------------------------------------------------------\n", "• Node type: materials\n", " - Number of nodes: 0\n", " - Number of features: 1\n", " - db_path: MatGraphDB\\nodes\\materials\n", "------------------------------------------------------------\n", "\n", "############################################################\n", "EDGE DETAILS\n", "############################################################\n", "Total edge types: 0\n", "------------------------------------------------------------\n", "\n", "############################################################\n", "NODE GENERATOR DETAILS\n", "############################################################\n", "Total node generators: 0\n", "------------------------------------------------------------\n", "\n", "############################################################\n", "EDGE GENERATOR DETAILS\n", "############################################################\n", "Total edge generators: 0\n", "------------------------------------------------------------\n", "\n" ] } ], "source": [ "print(mgdb.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll notice that a \"materials\" node store has been automatically created. This is because the `MatGraphDB` class automatically creates a node store for the \"materials\" node type." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Adding Materials\n", "\n", "We can create new materials in our \"materials\" store by calling `mgdb.create_material(...)`. This is a wrapper around the `create_material` method of the `MaterialNodes` class, which mean the arugments are the same.\n", "\n", "Materials can be created in `MatGraphDB` in two primary ways through the `create_materials` method:\n", "\n", "1. **Using a `pymatgen` `Structure` object**: \n", " This is the most direct way when you already have a `Structure` object that defines the material's atomic arrangement.\n", "\n", "2. **Providing atomic coordinates, species, and lattice manually**: \n", " This is useful when you have the raw data and want to construct the material without a `Structure` object. \n", "\n", "Additionally, you can include custom properties in the form of a nested dictionary to enrich the material data.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 1: Using a `Structure` object" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-10 17:17:41 - matgraphdb.materials.nodes.materials - INFO - Adding a new material.\n", "2025-01-10 17:17:41 - matgraphdb.materials.nodes.materials - INFO - Material added successfully.\n", " atomic_numbers cartesian_coords density \\\n", "0 [12, 8] [[0.0, 0.0, 0.0], [2.13, 2.13, 2.13]] 3.462843 \n", "\n", " density_atomic elements formula frac_coords id \\\n", "0 0.103481 [Mg, O] Mg1 O1 [[0.0, 0.0, 0.0], [0.5, 0.5, 0.5]] 0 \n", "\n", " lattice material_id ... \\\n", "0 [0.0, 2.13, 2.13, 2.13, 0.0, 2.13, 2.13, 2.13,... mp-1 ... \n", "\n", " structure.lattice.beta structure.lattice.c structure.lattice.gamma \\\n", "0 60.0 3.012275 60.0 \n", "\n", " structure.lattice.matrix structure.lattice.pbc \\\n", "0 [0.0, 2.13, 2.13, 2.13, 0.0, 2.13, 2.13, 2.13,... [True, True, True] \n", "\n", " structure.lattice.volume structure.sites \\\n", "0 19.327194 [{'abc': [0.0, 0.0, 0.0], 'label': 'Mg', 'prop... \n", "\n", " thermal_conductivity.unit thermal_conductivity.value volume \n", "0 W/mK 2.5 19.327194 \n", "\n", "[1 rows x 30 columns]\n", "(1, 30)\n" ] } ], "source": [ "from pymatgen.core Structure\n", "# Define a pymatgen Structure object\n", "structure = Structure(\n", " lattice=[[0, 2.13, 2.13], [2.13, 0, 2.13], [2.13, 2.13, 0]],\n", " species=[\"Mg\", \"O\"],\n", " coords=[[0, 0, 0], [0.5, 0.5, 0.5]],\n", ")\n", "\n", "material_data = {\n", " \"structure\": structure,\n", " \"properties\": {\n", " \"material_id\": \"mp-1\",\n", " \"source\": \"example\",\n", " \"thermal_conductivity\": {\"value\": 2.5, \"unit\": \"W/mK\"},\n", " },\n", "}\n", "\n", "# Add the material to the database.\n", "mgdb.create_material(\n", " structure=material_data[\"structure\"], properties=material_data[\"properties\"]\n", ")\n", "\n", "material_nodes = mgdb.read_nodes(\"materials\")\n", "print(material_nodes.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 2: Using atomic coordinates, species, and lattice manually" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-10 17:18:51 - matgraphdb.materials.core - INFO - Creating material.\n", "2025-01-10 17:18:51 - matgraphdb.materials.nodes.materials - INFO - Adding a new material.\n", "2025-01-10 17:18:51 - matgraphdb.materials.nodes.materials - INFO - Material added successfully.\n", " atomic_numbers band_gap.unit band_gap.value \\\n", "0 [12, 8] None NaN \n", "1 [12, 8] eV 1.2 \n", "\n", " cartesian_coords density density_atomic elements \\\n", "0 [[0.0, 0.0, 0.0], [2.13, 2.13, 2.13]] 3.462843 0.103481 [Mg, O] \n", "1 [[0.0, 0.0, 0.0], [2.13, 2.13, 2.13]] 3.462843 0.103481 [Mg, O] \n", "\n", " formula frac_coords id ... structure.lattice.beta \\\n", "0 Mg1 O1 [[0.0, 0.0, 0.0], [0.5, 0.5, 0.5]] 0 ... 60.0 \n", "1 Mg1 O1 [[0.0, 0.0, 0.0], [0.5, 0.5, 0.5]] 1 ... 60.0 \n", "\n", " structure.lattice.c structure.lattice.gamma \\\n", "0 3.012275 60.0 \n", "1 3.012275 60.0 \n", "\n", " structure.lattice.matrix structure.lattice.pbc \\\n", "0 [0.0, 2.13, 2.13, 2.13, 0.0, 2.13, 2.13, 2.13,... [True, True, True] \n", "1 [0.0, 2.13, 2.13, 2.13, 0.0, 2.13, 2.13, 2.13,... [True, True, True] \n", "\n", " structure.lattice.volume structure.sites \\\n", "0 19.327194 [{'abc': [0.0, 0.0, 0.0], 'label': 'Mg', 'prop... \n", "1 19.327194 [{'abc': [0.0, 0.0, 0.0], 'label': 'Mg', 'prop... \n", "\n", " thermal_conductivity.unit thermal_conductivity.value volume \n", "0 W/mK 2.5 19.327194 \n", "1 None NaN 19.327194 \n", "\n", "[2 rows x 32 columns]\n", "(2, 32)\n" ] } ], "source": [ "# Define atomic data\n", "coords = [[0, 0, 0], [0.5, 0.5, 0.5]]\n", "species = [\"Mg\", \"O\"]\n", "lattice = [[0, 2.13, 2.13], [2.13, 0, 2.13], [2.13, 2.13, 0]]\n", "\n", "# Add the material to the database\n", "material_data = {\n", " \"coords\": coords,\n", " \"species\": species,\n", " \"lattice\": lattice,\n", " \"properties\": {\n", " \"material_id\": \"mp-2\",\n", " \"source\": \"example_manual\",\n", " \"band_gap\": {\"value\": 1.2, \"unit\": \"eV\"},\n", " },\n", "}\n", "result = mgdb.create_material(\n", " coords=material_data[\"coords\"],\n", " species=material_data[\"species\"],\n", " lattice=material_data[\"lattice\"],\n", " properties=material_data[\"properties\"],\n", ")\n", "\n", "material_nodes = mgdb.read_nodes(\"materials\")\n", "print(material_nodes.to_pandas())\n", "print(material_nodes.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NodeStores and EdgeStores\n", "\n", "Any nodes or edges in MatGraphDB are stored in **NodeStores** and **EdgeStores**, respectively. These can be accessed through the `node_stores` and `edge_stores` attributes of the `MatGraphDB` class. These stores extend the capabilities of the `ParquetDB` class, allowing users to leverage all the features available in `ParquetDB`. \n", "\n", "#### What is ParquetDB?\n", "\n", "`ParquetDB` is a database framework built on top of **Apache Parquet** and **PyArrow**, optimized for handling large-scale data. Its core strength lies in its ability to efficiently store, query, and process vast amounts of data with minimal overhead. The foundation of `ParquetDB`—Apache Parquet—offers several advantages:\n", "\n", "- **Columnar Format**: Parquet organizes data by columns instead of rows, making it particularly effective for analytical queries that only need specific columns. This format improves compression efficiency and reduces I/O overhead. \n", "- **Schema Embedding**: Each Parquet file includes an embedded schema, ensuring consistency and enabling seamless schema evolution. \n", "- **Predicate Pushdown**: Parquet's structure allows queries to read only the necessary data blocks and relevant columns, significantly improving query performance by reducing data load times. \n", "- **Efficient Encoding and Compression**: Parquet supports column-level encoding and compression, enhancing both storage efficiency and read performance. \n", "- **Metadata Support**: Parquet files store metadata at both the table and field levels, which facilitates efficient querying and rich data descriptions. \n", "- **Batch Processing**: Data in Parquet files is organized into column groups, making it ideal for batch operations and high-throughput workflows.\n", "\n", "#### Why Use ParquetDB?\n", "\n", "By leveraging the advantages of Parquet files, `ParquetDB` efficiently handles the serialization and deserialization of complex datasets. It provides scalable and fast access to data, seamlessly integrating into machine learning workflows and big data pipelines. This is especially beneficial for material science applications in MatGraphDB, where datasets often involve complex and interconnected relationships between nodes and edges.\n", "\n", "#### Integration with MatGraphDB\n", "\n", "When nodes are added to MatGraphDB, they are indexed, and the index is stored in the `id` column. This ensures fast lookups and efficient data management. Whether you are storing materials, elements, or defining relationships between them, NodeStores and EdgeStores inherit all the features of `ParquetDB`, making them robust, scalable, and performant.\n", "\n", "For more details about `ParquetDB`, including its architecture and capabilities, refer to the [ParquetDB documentation](https://github.com/lllangWV/ParquetDB).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding Multiple Materials at Once\n", "\n", "If you have multiple materials, it is more efficient to add them in a bulk call. You can do this by using create_materials. Here's an example:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-10 17:39:19 - matgraphdb.materials.core - INFO - Creating materials.\n", "2025-01-10 17:39:19 - matgraphdb.materials.nodes.materials - INFO - Adding 2 materials to the database.\n", "2025-01-10 17:39:19 - matgraphdb.utils.mp_utils - INFO - Passing the following arguments to the worker method\n", "2025-01-10 17:39:21 - matgraphdb.materials.nodes.materials - ERROR - Error adding material: Unable to merge: Field atomic_numbers has incompatible types: list<element: int64> vs extension<arrow.fixed_shape_tensor[value_type=int64, shape=[2]]>\n", "2025-01-10 17:39:21 - matgraphdb.materials.nodes.materials - INFO - All materials added successfully.\n", " atomic_numbers band_gap.unit band_gap.value \\\n", "0 [12, 8] None NaN \n", "1 [12, 8] eV 1.2 \n", "\n", " cartesian_coords density density_atomic elements \\\n", "0 [[0.0, 0.0, 0.0], [2.13, 2.13, 2.13]] 3.462843 0.103481 [Mg, O] \n", "1 [[0.0, 0.0, 0.0], [2.13, 2.13, 2.13]] 3.462843 0.103481 [Mg, O] \n", "\n", " formula frac_coords id ... structure.lattice.beta \\\n", "0 Mg1 O1 [[0.0, 0.0, 0.0], [0.5, 0.5, 0.5]] 0 ... 60.0 \n", "1 Mg1 O1 [[0.0, 0.0, 0.0], [0.5, 0.5, 0.5]] 1 ... 60.0 \n", "\n", " structure.lattice.c structure.lattice.gamma \\\n", "0 3.012275 60.0 \n", "1 3.012275 60.0 \n", "\n", " structure.lattice.matrix structure.lattice.pbc \\\n", "0 [0.0, 2.13, 2.13, 2.13, 0.0, 2.13, 2.13, 2.13,... [True, True, True] \n", "1 [0.0, 2.13, 2.13, 2.13, 0.0, 2.13, 2.13, 2.13,... [True, True, True] \n", "\n", " structure.lattice.volume structure.sites \\\n", "0 19.327194 [{'abc': [0.0, 0.0, 0.0], 'label': 'Mg', 'prop... \n", "1 19.327194 [{'abc': [0.0, 0.0, 0.0], 'label': 'Mg', 'prop... \n", "\n", " thermal_conductivity.unit thermal_conductivity.value volume \n", "0 W/mK 2.5 19.327194 \n", "1 None NaN 19.327194 \n", "\n", "[2 rows x 32 columns]\n", "(2, 32)\n" ] } ], "source": [ "materials = [\n", " {\n", " \"structure\": structure,\n", " \"properties\": {\"material_id\": \"mp-3\", \"density\": 3.95},\n", " },\n", " {\n", " \"coords\": [[0, 0, 0], [0.5, 0.5, 0.5]],\n", " \"species\": [\"Si\", \"O\"],\n", " \"lattice\": [[0, 3.1, 3.1], [3.1, 0, 3.1], [3.1, 3.1, 0]],\n", " \"properties\": {\"material_id\": \"mp-4\", \"band_gap\": 1.8},\n", " },\n", "]\n", "\n", "mgdb.create_materials(materials)\n", "\n", "material_nodes = mgdb.read_nodes(\"materials\")\n", "print(material_nodes.to_pandas())\n", "print(material_nodes.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Reading Materials\n", "\n", "In the previous section, we added materials to the database and performed a basic read operation, but lets go into more details for the read operations.\n", "\n", "\n", "To start, we can read all materials in the database with the `read_materials` method." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-11 14:09:04 - matgraphdb.materials.core - INFO - Reading materials.\n", "All materials:\n", " atomic_numbers band_gap.unit band_gap.value \\\n", "0 [12, 8] None NaN \n", "1 [12, 8] eV 1.2 \n", "\n", " cartesian_coords density density_atomic elements \\\n", "0 [[0.0, 0.0, 0.0], [2.13, 2.13, 2.13]] 3.462843 0.103481 [Mg, O] \n", "1 [[0.0, 0.0, 0.0], [2.13, 2.13, 2.13]] 3.462843 0.103481 [Mg, O] \n", "\n", " formula frac_coords id ... structure.lattice.beta \\\n", "0 Mg1 O1 [[0.0, 0.0, 0.0], [0.5, 0.5, 0.5]] 0 ... 60.0 \n", "1 Mg1 O1 [[0.0, 0.0, 0.0], [0.5, 0.5, 0.5]] 1 ... 60.0 \n", "\n", " structure.lattice.c structure.lattice.gamma \\\n", "0 3.012275 60.0 \n", "1 3.012275 60.0 \n", "\n", " structure.lattice.matrix structure.lattice.pbc \\\n", "0 [0.0, 2.13, 2.13, 2.13, 0.0, 2.13, 2.13, 2.13,... [True, True, True] \n", "1 [0.0, 2.13, 2.13, 2.13, 0.0, 2.13, 2.13, 2.13,... [True, True, True] \n", "\n", " structure.lattice.volume structure.sites \\\n", "0 19.327194 [{'abc': [0.0, 0.0, 0.0], 'label': 'Mg', 'prop... \n", "1 19.327194 [{'abc': [0.0, 0.0, 0.0], 'label': 'Mg', 'prop... \n", "\n", " thermal_conductivity.unit thermal_conductivity.value volume \n", "0 W/mK 2.5 19.327194 \n", "1 None NaN 19.327194 \n", "\n", "[2 rows x 32 columns]\n" ] } ], "source": [ "# Read all materials\n", "materials = mgdb.read_materials()\n", "print(\"All materials:\")\n", "print(materials.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Reading Specific Columns\n", "\n", "We can also read specific columns from the materials node store. This will only read these columns into memory." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-11 14:17:30 - matgraphdb.materials.core - INFO - Reading materials.\n", "\n", "Subset of materials data:\n", " material_id elements\n", "0 mp-1 [Mg, O]\n", "1 mp-2 [Mg, O]\n" ] } ], "source": [ "# Read specific columns\n", "materials_subset = mgdb.read_materials(columns=[\"material_id\", \"elements\"])\n", "print(\"\\nSubset of materials data:\")\n", "print(materials_subset.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Reading with Filters\n", "\n", "We can also read materials with filters. This will return a filtered view of the materials node store. This follows the filter syntax of of pyarrow Expressions. More details can be found in the [pyarrow documentation](https://arrow.apache.org/docs/python/compute.html#filtering-by-expressions).\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-11 14:09:53 - matgraphdb.materials.core - INFO - Reading materials.\n", "\n", "Materials with band gap > 1.0:\n", " material_id elements band_gap.value\n", "0 mp-2 [Mg, O] 1.2\n" ] } ], "source": [ "# Read materials with filters\n", "import pyarrow.compute as pc\n", "\n", "materials_filtered = mgdb.read_materials(\n", " columns=[\"material_id\", \"elements\", \"band_gap.value\"],\n", " filters=[pc.field(\"band_gap.value\") == 1.2],\n", ")\n", "print(\"\\nMaterials with band gap == 1.2:\")\n", "print(materials_filtered.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Filering by ids\n", "\n", "You can also filter by ids. This will return the materials with the specified ids. This is just adding a filter on the `id` column. `pc.field(\"id\").isin(ids)`." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-11 14:24:20 - matgraphdb.materials.core - INFO - Reading materials.\n", "\n", "Materials with ids 1:\n", " material_id elements id\n", "0 mp-2 [Mg, O] 1\n" ] } ], "source": [ "materials_filtered = mgdb.read_materials(\n", " columns=[\"material_id\", \"elements\", \"id\"], ids=[1]\n", ")\n", "print(\"\\nMaterials with ids 1:\")\n", "print(materials_filtered.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Rebuilding the nested structure of the data\n", "\n", "You may have noticed when reading the data some fields are given as `band_gap.value` instead of just `band_gap`. This is because the data is stored in a nested structure. ParquetDB stores the data in a flat structure, so when you read the data, the nested structure is not preserved. You may want to rebuild the nested structure of the data by using the `rebuild_nested_structure` method. This has a cost for the intial read, but subsequent reads will be faster. \n", "\n", "> Note the nested structures are only built the snapshot of the data you read. If you modifiy the data, you will need to rebuild from scratch.\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-11 14:23:34 - matgraphdb.materials.core - INFO - Reading materials.\n", "\n", "Materials:\n", " material_id elements band_gap\n", "0 mp-1 [Mg, O] {'unit': None, 'value': None}\n", "1 mp-2 [Mg, O] {'unit': 'eV', 'value': 1.2}\n", "2025-01-11 14:23:34 - matgraphdb.materials.core - INFO - Reading materials.\n", "\n", "Materials:\n", " material_id elements band_gap\n", "0 mp-1 [Mg, O] {'unit': None, 'value': None}\n", "1 mp-2 [Mg, O] {'unit': 'eV', 'value': 1.2}\n" ] } ], "source": [ "# If you rebuild you can pass the parent column name to the `columns` argument.\n", "materials_rebuilt = mgdb.read_materials(\n", " columns=[\"material_id\", \"elements\", \"band_gap\"],\n", " rebuild_nested_struct=True,\n", ")\n", "print(\"\\nMaterials:\")\n", "print(materials_rebuilt.to_pandas())\n", "\n", "\n", "# Building from scratch\n", "materials_rebuilt = mgdb.read_materials(\n", " columns=[\"material_id\", \"elements\", \"band_gap\"],\n", " rebuild_nested_struct=True,\n", " rebuild_nested_from_scratch=True,\n", ")\n", "print(\"\\nMaterials:\")\n", "print(materials_rebuilt.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Loading data in batches\n", "\n", "You can also load data in batches. This is useful for loading large amounts of data into memory." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-11 14:15:23 - matgraphdb.materials.core - INFO - Reading materials.\n", "\n", "Materials in batches:\n", "Batch 0:\n", " material_id elements id\n", "0 mp-1 [Mg, O] 0\n", "Batch 1:\n", " material_id elements id\n", "0 mp-2 [Mg, O] 1\n" ] } ], "source": [ "materials_generator = mgdb.read_materials(\n", " columns=[\"material_id\", \"elements\", \"id\"], load_format=\"batches\", batch_size=1\n", ")\n", "print(\"\\nMaterials in batches:\")\n", "for i, batch_table in enumerate(materials_generator):\n", " print(f\"Batch {i}:\")\n", " print(batch_table.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Accessing the materials node store\n", "\n", "In the previous examples, we have used utility method of the `MatGraphDB` class to read the materials. However, you can also access the materials node store directly. The `MaterialNodes` class extends the `NodeStore` which extends the `ParquetDB` class, so you can use all the methods available in `ParquetDB` and `NodeStore`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<class 'matgraphdb.materials.nodes.materials.MaterialNodes'>\n", "(2, 32)\n", "(2, 32)\n", "(2, 32)\n" ] } ], "source": [ "materials_store = mgdb.node_stores[\"materials\"]\n", "materials_store = mgdb.material_nodes\n", "materials_store = mgdb.get_node_store(node_type=\"materials\")\n", "print(type(materials_store))\n", "\n", "# This method is belongs the the MaterialNodes class, which extends the NodeStore class.\n", "materials = materials_store.read_materials()\n", "print(materials.shape)\n", "\n", "# This method is belongs the the NodeStore class.\n", "materials = materials_store.read_nodes()\n", "print(materials.shape)\n", "\n", "\n", "# This method is belongs the the ParquetDB class.\n", "materials = materials_store.read()\n", "print(materials.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Updating Materials\n", "\n", "We can update the materials in the database by calling the `update_materials` method. This method will update the materials with the specified `id`. \n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-11 14:41:14 - matgraphdb.materials.core - INFO - Reading materials.\n", "Updated material:\n", " id material_id band_gap.value\n", "0 0 mp-1 NaN\n", "1 1 mp-2 1.2\n", "2025-01-11 14:41:14 - matgraphdb.materials.core - INFO - Updating materials.\n", "2025-01-11 14:41:14 - matgraphdb.materials.nodes.materials - INFO - Updating data\n", "2025-01-11 14:41:14 - matgraphdb.core.nodes - INFO - Updating 1 node records\n", "2025-01-11 14:41:14 - matgraphdb.materials.nodes.materials - INFO - Data updated successfully.\n", "2025-01-11 14:41:14 - matgraphdb.materials.core - INFO - Reading materials.\n", "Updated material:\n", " id material_id band_gap.value\n", "0 0 mp-1 3.6\n", "1 1 mp-2 1.2\n" ] } ], "source": [ "# You can update existing materials in the database:\n", "# Update the band gap of of material with id 0\n", "\n", "# Read the updated material\n", "updated_material = mgdb.read_materials(\n", " columns=[\"id\", \"material_id\", \"band_gap.value\"],\n", ")\n", "\n", "print(\"Updated material:\")\n", "print(updated_material.to_pandas())\n", "\n", "update_data = [\n", " {\n", " \"id\": 0,\n", " \"band_gap\": {\"value\": 3.6, \"unit\": \"eV\"},\n", " },\n", "]\n", "\n", "mgdb.update_materials(update_data)\n", "\n", "# Read the updated material\n", "updated_material = mgdb.read_materials(\n", " columns=[\"id\", \"material_id\", \"band_gap.value\"],\n", ")\n", "\n", "print(\"Updated material:\")\n", "print(updated_material.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Updating on a differnt key\n", "\n", "You can also update on a different key. This is useful if you want to update the materials with the specified `material_id`. You can also update on multiple keys.\n", "\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-11 14:43:00 - matgraphdb.materials.core - INFO - Reading materials.\n", "Updated material:\n", " id material_id band_gap.value\n", "0 0 mp-1 3.6\n", "1 1 mp-2 1.2\n", "2025-01-11 14:43:00 - matgraphdb.materials.core - INFO - Updating materials.\n", "2025-01-11 14:43:00 - matgraphdb.materials.nodes.materials - INFO - Updating data\n", "2025-01-11 14:43:00 - matgraphdb.core.nodes - INFO - Updating 1 node records\n", "2025-01-11 14:43:00 - matgraphdb.materials.nodes.materials - INFO - Data updated successfully.\n", "2025-01-11 14:43:00 - matgraphdb.materials.core - INFO - Reading materials.\n", "Updated material:\n", " id material_id band_gap.value\n", "0 0 mp-1 0.1\n", "1 1 mp-2 1.2\n" ] } ], "source": [ "# Read the updated material\n", "updated_material = mgdb.read_materials(\n", " columns=[\"id\", \"material_id\", \"band_gap.value\"],\n", ")\n", "\n", "print(\"Updated material:\")\n", "print(updated_material.to_pandas())\n", "\n", "update_data = [\n", " {\n", " \"material_id\": \"mp-1\",\n", " \"band_gap\": {\"value\": 0.1, \"unit\": \"eV\"},\n", " },\n", "]\n", "\n", "mgdb.update_materials(update_data, update_keys=[\"material_id\"])\n", "\n", "\n", "# Read the updated material\n", "updated_material = mgdb.read_materials(\n", " columns=[\"id\", \"material_id\", \"band_gap.value\"],\n", ")\n", "\n", "print(\"Updated material:\")\n", "print(updated_material.to_pandas())\n", "\n", "\n", "update_data = [\n", " {\n", " \"id\": 1,\n", " \"material_id\": \"mp-1\",\n", " \"band_gap\": {\"value\": 0.1, \"unit\": \"eV\"},\n", " },\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Deleting Materials\n", "\n", "We can delete the materials in the database by calling the `delete_materials` method. This method will delete the materials with the specified `id`. " ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-11 14:47:14 - matgraphdb.materials.core - INFO - Deleting materials.\n", "2025-01-11 14:47:14 - matgraphdb.materials.nodes.materials - INFO - Deleting data [0]\n", "2025-01-11 14:47:15 - matgraphdb.materials.nodes.materials - INFO - Data deleted successfully.\n", "2025-01-11 14:47:15 - matgraphdb.materials.core - INFO - Reading materials.\n", "Updated material:\n", " id material_id band_gap.value\n", "0 1 mp-2 1.2\n" ] } ], "source": [ "# Delete the material with id 0\n", "mgdb.delete_materials(ids=[0])\n", "\n", "# Read the updated material\n", "updated_material = mgdb.read_materials(\n", " columns=[\"id\", \"material_id\", \"band_gap.value\"],\n", ")\n", "\n", "print(\"Updated material:\")\n", "print(updated_material.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Deleting columns\n", "\n", "You can also delete columns from the materials node store. This will delete the columns from the database.\n", "\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-01-11 14:48:15 - matgraphdb.materials.core - INFO - Deleting materials.\n", "2025-01-11 14:48:15 - matgraphdb.materials.nodes.materials - INFO - Deleting data None\n", "2025-01-11 14:48:15 - matgraphdb.materials.nodes.materials - INFO - Data deleted successfully.\n", "2025-01-11 14:48:15 - matgraphdb.materials.core - INFO - Reading materials.\n", "Error loading table: No match for FieldRef.Nested(FieldRef.Name(band_gap) FieldRef.Name(value)) in atomic_numbers: list<element: int64>\n", "band_gap.unit: string\n", "cartesian_coords: list<element: list<element: double>>\n", "density: double\n", "density_atomic: double\n", "elements: list<element: string>\n", "formula: string\n", "frac_coords: list<element: list<element: double>>\n", "id: int64\n", "lattice: extension<arrow.fixed_shape_tensor[value_type=double, shape=[3,3]]>\n", "material_id: string\n", "nelements: int64\n", "nsites: int64\n", "source: string\n", "species: list<element: string>\n", "structure.@class: string\n", "structure.@module: string\n", "structure.charge: int64\n", "structure.lattice.a: double\n", "structure.lattice.alpha: double\n", "structure.lattice.b: double\n", "structure.lattice.beta: double\n", "structure.lattice.c: double\n", "structure.lattice.gamma: double\n", "structure.lattice.matrix: extension<arrow.fixed_shape_tensor[value_type=double, shape=[3,3]]>\n", "structure.lattice.pbc: extension<arrow.fixed_shape_tensor[value_type=bool, shape=[3]]>\n", "structure.lattice.volume: double\n", "structure.sites: list<element: struct<abc: list<element: double>, label: string, properties: struct<dummy_field: int16>, species: list<element: struct<element: string, occu: int64>>, xyz: list<element: double>>>\n", "thermal_conductivity.unit: string\n", "thermal_conductivity.value: double\n", "volume: double\n", "__fragment_index: int32\n", "__batch_index: int32\n", "__last_in_fragment: bool\n", "__filename: string. Returning empty table\n", "Updated material:\n", "Empty DataFrame\n", "Columns: [id, material_id]\n", "Index: []\n" ] } ], "source": [ "# Delete the material with id 0\n", "mgdb.delete_materials(columns=[\"band_gap.value\"])\n", "\n", "# Read the updated material\n", "updated_material = mgdb.read_materials(\n", " columns=[\"id\", \"material_id\", \"band_gap.value\"],\n", ")\n", "\n", "print(\"Updated material:\")\n", "print(updated_material.to_pandas())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "This notebook demonstrated the basic usage of MatGraphDB, including creating, reading, updating, and deleting materials. In the next notebook, we will explore a more complex example and see how we can add and query nodes and edges to the database." ] } ], "metadata": { "kernelspec": { "display_name": "matgraphdb_dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 4 }